author: “Del Middlemiss”
date: “15th August 2019”
+, * and : operators in formula notationsNew packages: mosaicData, GGally, ggiraphExtra
Duration - 120 minutes
We’re going to introduce the elements of multiple linear regression one-by-one to help you understand their use and effect. Our starting point will be simple linear regression with a response variable \(y\) and a continuous predictor \(x\). This is just the sort of linear regression scenario we looked at in module 2! We’ll then look at adding:
RailTrail dataLet’s have a look at a dataset with multiple predictors. The RailTrail data from the mosaicData package contains 90 daily observations of the volume of users on a ‘railtrail’ (i.e. a former rail line converted to a path) together with a number of other factors: the average daily temperature, whether that day was a weekday, precipitation and cloudcover etc… A laser counting station was used to gather the user volume values, coupled to a set of weather sensors for the other factors.
We’re going to build a regression model for user volume as a function of one or more predictor variables.
library(mosaicData)
library(tidyverse)
glimpse(RailTrail)
## Observations: 90
## Variables: 11
## $ hightemp <int> 83, 73, 74, 95, 44, 69, 66, 66, 80, 79, 78, 65, 41, 5…
## $ lowtemp <int> 50, 49, 52, 61, 52, 54, 39, 38, 55, 45, 55, 48, 49, 3…
## $ avgtemp <dbl> 66.5, 61.0, 63.0, 78.0, 48.0, 61.5, 52.5, 52.0, 67.5,…
## $ spring <int> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,…
## $ summer <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ fall <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ cloudcover <dbl> 7.6, 6.3, 7.5, 2.6, 10.0, 6.6, 2.4, 0.0, 3.8, 4.1, 8.…
## $ precip <dbl> 0.00, 0.29, 0.32, 0.00, 0.14, 0.02, 0.00, 0.00, 0.00,…
## $ volume <int> 501, 419, 397, 385, 200, 375, 417, 629, 533, 547, 432…
## $ weekday <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FAL…
## $ dayType <chr> "weekday", "weekday", "weekday", "weekend", "weekday"…
Task - 3 mins
Take a few minutes to acquaint yourself with the
RailTraildata * perhaps try runningsummary()* plotvolumeagainst a few predictors that you think may be of interest
Solution
summary(RailTrail)## hightemp lowtemp avgtemp spring ## Min. :41.00 Min. :19.00 Min. :33.00 Min. :0.0000 ## 1st Qu.:59.25 1st Qu.:38.00 1st Qu.:48.62 1st Qu.:0.0000 ## Median :69.50 Median :44.50 Median :55.25 Median :1.0000 ## Mean :68.83 Mean :46.03 Mean :57.43 Mean :0.5889 ## 3rd Qu.:77.75 3rd Qu.:53.75 3rd Qu.:64.50 3rd Qu.:1.0000 ## Max. :97.00 Max. :72.00 Max. :84.00 Max. :1.0000 ## summer fall cloudcover precip ## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :0.00000 ## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.650 1st Qu.:0.00000 ## Median :0.0000 Median :0.0000 Median : 6.400 Median :0.00000 ## Mean :0.2778 Mean :0.1333 Mean : 5.807 Mean :0.09256 ## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 8.475 3rd Qu.:0.02000 ## Max. :1.0000 Max. :1.0000 Max. :10.000 Max. :1.49000 ## volume weekday dayType ## Min. :129.0 Mode :logical Length:90 ## 1st Qu.:291.5 FALSE:28 Class :character ## Median :373.0 TRUE :62 Mode :character ## Mean :375.4 ## 3rd Qu.:451.2 ## Max. :736.0RailTrail %>% ggplot(aes(x = avgtemp, y = volume)) + geom_point()
RailTrail %>% ggplot(aes(x = weekday, y = volume)) + geom_boxplot()
RailTrail %>% ggplot(aes(x = precip, y = volume)) + geom_point()![]()
ggpairs() in the GGally packageSome of the predictors look redundant, so let’s trim them out. We’ll only use a single measure of temperature (say avgtemp), and we can also remove fall and dayType, as they can be computed from other predictors (i.e. they are aliases).
RailTrail_trim <- RailTrail %>%
select(-c("hightemp", "lowtemp", "fall", "dayType"))
Now we can simply plot the trimmed dataframe to see a useful basic visualisation
plot(RailTrail_trim)
or we can obtain an enhanced version of the same plot using the ggpairs() function in the GGally package!
library(GGally)
ggpairs(RailTrail_trim)
We should examine these plots to look for predictors that appear significantly associated with our response variable (volume in this case). We should also take a keen interest in how the predictors correlate with each other.
Examining the pairs plots above, we see a reasonably strong association between the user volume each day and avgtemp, the average daily temperature (degrees Fahrenheit). Let’s run simple linear regression as follows:
volumeavgtempWe’re hypothesising that people are more likely to use the rail trail in warm weather.
RailTrail_trim %>%
ggplot(aes(x = avgtemp, y = volume)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Task - 5 mins
Just to get back in the ‘groove’, run a simple linear regression on the
RailTraildata, takingvolumeas response andavgtempas predictor. Check the regression assumptions. How well does this simple model perform in predictingvolume?[Hint - remember, we can check the regression assumptions by plotting the model object]
Solution
model <- lm(volume ~ avgtemp, data = RailTrail_trim) par(mfrow = c(2, 2)) plot(model)
summary(model)## ## Call: ## lm(formula = volume ~ avgtemp, data = RailTrail_trim) ## ## Residuals: ## Min 1Q Median 3Q Max ## -237.53 -80.00 8.49 55.14 326.67 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 99.602 63.474 1.569 0.12 ## avgtemp 4.802 1.084 4.428 2.72e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 115.9 on 88 degrees of freedom ## Multiple R-squared: 0.1822, Adjusted R-squared: 0.1729 ## F-statistic: 19.61 on 1 and 88 DF, p-value: 2.723e-05The regression doesn’t seem too bad, although there is some evidence of systematic variation not captured in the
Residuals vs Fittedplot, some deviation from normality in the high quantile residuals in theNormal Q-Qplot, and evidence of rather mild heteroscedasticity in theScale-Locationplot.Alas, however, the model isn’t very effective. The \(r^2\) value is \(0.18\), and the residual standard error is \(115.9\). To put the latter in context, let’s see the boxplot of
volume.RailTrail_trim %>% ggplot(aes(y = volume)) + geom_boxplot()The median is just below \(400\) users, and our estimated
volumevalues are accurate to only \(116\) users on average (we get this from the residual standard error)! So our estimates are out by around \(25%\) of the typical uservolume
We can hopefully do better by adding further predictors to the model, taking us into the territory of multiple linear regression! Adding predictors might well also fix some of the rather mild breaches of the regression assumptions we have observed. Fingers crossed!
We’ll start by adding a categorical predictor variable, yielding what’s known as a parallel slopes model.
Let’s hypothesise that weekday has an effect on usage of the rail trail. We could probably argue the effect in either direction: that weekdays increase volume due to use by people commuting to work, or perhaps that weekends are busier due to recreational use. We shall see…
Task - 2 mins
Try plotting an appropriate visualisation to determine whether user
volumeis associated with theweekdaypredictor.
Solution
Two boxplots split by
weekdayare appropriate here.RailTrail_trim %>% ggplot(aes(x = weekday, y = volume)) + geom_boxplot()
The boxplots are split, so we can conclude there is an association of the two variables.
To be more specific, let’s calculate the point-biserial correlation coefficient (this is just the normal Pearson correlation coefficient applied to the case of one continuous- and one dichotomous variable).
RailTrail_trim %>% summarise(cor = cor(weekday, volume))This confirms a weak negative correlation.
Systematic model development
We are not really following a process of systematic model development here. We’ll detail that process in the next lesson: for now, we are content just to discuss the ‘ingredients’ of multiple linear regression.
If we wished to be systematic about adding a predictor to our model beyond avgtemp, we would need to see which predictor is most strongly associated with the response volume after the systematic variation due to avgtemp has been subtracted. We would do this iteratively:
We would keep cycling in this way until we are satisfied with the model.
Let’s add weekday into our regression model!
\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday}\]
In the patsy formula notation, this is just as simple as saying volume ~ avgtemp + weekday
model2 <- lm(volume ~ avgtemp + weekday, data = RailTrail_trim)
The \(+\) operator in patsy formula notation
Note that \(+\) in this formula doesn’t mean ‘add the two variables together’ here, i.e. the regression equation is not something like
\[\widehat{\textrm{volume}} = \textrm{intercept} + b \times (\textrm{avgtemp} + \textrm{weekday})\] Instead, read the \(+\) operator in a formula as meaning ‘add predictor weekday into the model with its own coefficient’. Similarly, the \(*\) operator has a new meaning that we’ll see below (along with a new one: the \(:\) operator).
Let’s see the summary() and diagnostics of the model
summary(model2)
##
## Call:
## lm(formula = volume ~ avgtemp + weekday, data = RailTrail_trim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -242.92 -75.72 3.15 66.90 280.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 163.022 65.428 2.492 0.01461 *
## avgtemp 4.541 1.050 4.323 4.08e-05 ***
## weekdayTRUE -70.320 25.566 -2.751 0.00724 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 111.8 on 87 degrees of freedom
## Multiple R-squared: 0.2476, Adjusted R-squared: 0.2303
## F-statistic: 14.32 on 2 and 87 DF, p-value: 4.214e-06
par(mfrow = c(2,2))
plot(model2)
R has treated the weekday variable, it has split it into a single dummy variable weekdayTRUE with weekdayFALSE being absorbed into the intercept. So weekdayFALSE is being treated as the reference level.Task - 2 mins
How should we interpret the value of the coefficient for
weekdayTRUE? Have a think about this, and discuss it with the people around you.
Solution
In the following way: on average, there are approximately \(70\) fewer users on the rail trail each weekday as compared with a Saturday or a Sunday (with
avgtempheld constant: we’ll discuss this more fully below).
Why is this called a ‘parallel slopes model’? The easiest way to answer this is by visualising the model. The ggPredict() function in the ggiraphExtra package can help with this!
library(ggiraphExtra)
ggPredict(model2, interactive = TRUE)
So we see the model effectively has two lines: one for weekdays, and another for weekend days. The lines are parallel (their slopes are the same), so we call this a ‘parallel slopes model’.
How do these two lines come about? Let’s look again at the regression equation:
\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday}\]
The slope of the line with respect to avgtemp is \(b_\textrm{weekday}\), and remember that weekday is categorical. So for weekday = TRUE values, the line is effectively
\[\widehat{\textrm{volume}} = (\textrm{intercept} + b_{\textrm{weekday}}) + b_{\textrm{avgtemp}} \times \textrm{avgtemp}\]
i.e. just a shift in the intercept, while for weekday = FALSE values, we get
\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp}\]
The lines differ only in intercept; they are parallel.
Task - 5 mins
Try adding the
summercategorical predictor to the existing model withavgtempandweekday.\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday} + b_{\textrm{summer}} \times \textrm{summer}\]
- How many lines do you expect to see in this model?
- Is this a parallel slopes model? [Hint try
ggPredict()on the model object]- Is the addition of this predictor justified [Hint what is the p-value of \(b_{\textrm{summer}}\)]?
Solution
We expect four lines in this case, as each of
weekdayandsummercan take \(2\) values, so \(2 \times 2 = 4\). We still have a parallel slopes model, assummeris also categorical, so the four lines will be parallel (only their intercepts will differ).Now let’s run the regression and diagnostics:
model3 <- lm(volume ~ avgtemp + weekday + summer, data = RailTrail_trim) summary(model3)## ## Call: ## lm(formula = volume ~ avgtemp + weekday + summer, data = RailTrail_trim) ## ## Residuals: ## Min 1Q Median 3Q Max ## -236.165 -75.448 6.184 68.496 252.680 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 71.376 90.336 0.790 0.431630 ## avgtemp 6.387 1.639 3.898 0.000192 *** ## weekdayTRUE -66.958 25.505 -2.625 0.010244 * ## summer -59.977 41.052 -1.461 0.147662 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 111.1 on 86 degrees of freedom ## Multiple R-squared: 0.2659, Adjusted R-squared: 0.2402 ## F-statistic: 10.38 on 3 and 86 DF, p-value: 6.714e-06par(mfrow = c(2, 2)) plot(model3)
So, we find only a pretty small improvement in \(r^2\) from \(0.248\) to \(0.266\), and the p-value of the
summercoefficient indicates that the predictor is not significant!Next, let’s plot the model.
ggPredict(model3, interactive = TRUE)We find four parallel lines, as expected! However, we’ll leave
summerout of the model, as the p-value of its coefficient indicates that it is not significant.Why did
summerturn out not to be a significant predictor ofvolume? Likely becauseavgtempandsummerare correlated:RailTrail_trim %>% summarise(cor = cor(summer, volume))so it is possible that we gain little extra information in the model by including
summeralong withavgtemp. In other words, high average temperatures are likely already a strong enough basis to indicate a summer’s day.
Let’s try to interpret the following plot:
RailTrail_trim %>%
ggplot(aes(x = avgtemp, y = volume, color = weekday)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
What we see is equivalent to grouping the data by weekday and obtaining best-fit lines of volume versus avgtemp for each group of data independently. It’s clear that the slopes and intercepts of these best-fit lines vary depending on whether weekday is TRUE or FALSE.
How can we add this behaviour into our model? By adding an interaction between avgtemp and weekday.
\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday} + b_{\textrm{avgtemp:weekday}} \times \textrm{avgtemp} \times \textrm{weekday}\]
The last term here is the interaction. Notice that it involves the product of the two predictors: \(\textrm{avgtemp} \times \textrm{weekday}\).
We do this in patsy formula notation with the \(:\) operator as follows:
model4 <- lm(volume ~ avgtemp + weekday + avgtemp:weekday, data = RailTrail_trim)
We can read \(:\) as ‘…interacting with…’, so + avgtemp:weekday can be read as ‘add avgtemp interacting with weekday’. Let’s visualise the model
ggPredict(model4, interactive = TRUE)
This looks promising! It’s identical to the graph we plotted earlier!
Alternatively, we could have equivalently specified model4 as
alt_model4 <- lm(volume ~ avgtemp * weekday, data = RailTrail_trim)
ggPredict(alt_model4, interactive = TRUE)
We can read \(*\) as ‘these predictors and all possible interactions between them’. For the moment, though, it’s probably easier to stick to using \(:\) explicitly to include interactions.
Task - 2 mins
Examine the summary of the model including the interaction betweenavgtempandweekday. Is our inclusion of the interaction justified?
Solution
summary(model4) # or summary(alt_model4)## ## Call: ## lm(formula = volume ~ avgtemp + weekday + avgtemp:weekday, data = RailTrail_trim) ## ## Residuals: ## Min 1Q Median 3Q Max ## -269.906 -78.862 2.064 68.190 291.641 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 285.887 103.390 2.765 0.00696 ** ## avgtemp 2.457 1.718 1.431 0.15619 ## weekdayTRUE -262.193 128.181 -2.045 0.04386 * ## avgtemp:weekdayTRUE 3.300 2.161 1.527 0.13040 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 111 on 86 degrees of freedom ## Multiple R-squared: 0.2675, Adjusted R-squared: 0.2419 ## F-statistic: 10.47 on 3 and 86 DF, p-value: 6.115e-06par(mfrow = c(2, 2)) plot(model4)Alas, including this interaction is not justified, the p-value of the coefficient at approximately \(0.13\) is greater than our significance level of \(0.05\).
Why does the interaction term lead to different slopes and intercepts?
Why does including the avgtemp:weekday interaction term lead to different slopes and intercepts for weekday = TRUE and weekday = FALSE?
To see this, let’s rearrange the regression equation:
\[\widehat{\textrm{volume}} = (\textrm{intercept} + b_{\textrm{weekday}} \times \textrm{weekday}) + (b_{\textrm{avgtemp}} + b_{\textrm{avgtemp:weekday}} \times \textrm{weekday}) \times \textrm{avgtemp}\]
The first term in parentheses \(()\) is the overall ‘intercept’, and the second term in parentheses is the overall ‘slope’, and we see now that these depend on weekday!
| case | ‘intercept’ | ‘slope’ |
|---|---|---|
weekday = FALSE (or 0) |
\(\textrm{intercept}\) | \(b_{\textrm{avgtemp}}\) |
weekday = TRUE (or 1) |
\(\textrm{intercept} + b_{\textrm{weekday}}\) | \(b_{\textrm{avgtemp}} + b_{\textrm{avgtemp:weekday}}\) |
We’re not having a lot of luck in model development so far (mainly because we are not being systematic), but let’s keep doggedly trying to improve our model! This time, we’ll add another continuous predictor, say cloudcover, which measures the cloud cover observed each day (apparently in ‘oktas’, or eigths of the sky, but the values seem more consistent with tenths of the sky).
RailTrail_trim %>%
summarise(min = min(cloudcover), max = max(cloudcover))
Let’s first investigate whether cloudcover and volume are associated:
RailTrail_trim %>%
ggplot(aes(x = cloudcover, y = volume)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
We have a reasonable association, but bear in mind that we are still not being systematic, as we haven’t subtracted off variation due to avgtemp and weekday. Nevertheless, let’s go ahead and add cloudcover to the regression
# add cloudcover before weekday for ggPredict() later
model5 <- lm(volume ~ avgtemp + cloudcover + weekday, data = RailTrail_trim)
summary(model5)
##
## Call:
## lm(formula = volume ~ avgtemp + cloudcover + weekday, data = RailTrail_trim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -204.803 -58.375 6.594 60.178 277.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200.1022 59.1826 3.381 0.00109 **
## avgtemp 5.2461 0.9536 5.502 3.81e-07 ***
## cloudcover -16.0115 3.3952 -4.716 9.21e-06 ***
## weekdayTRUE -47.9480 23.4061 -2.049 0.04356 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.3 on 86 degrees of freedom
## Multiple R-squared: 0.4022, Adjusted R-squared: 0.3814
## F-statistic: 19.29 on 3 and 86 DF, p-value: 1.185e-09
par(mfrow = c(2, 2))
plot(model5)
Yay, progress again! It looks like cloudcover is useful: \(r^2\) increases from \(0.25\) to \(0.40\), and the residual standard error value drops by approximately \(11\) users. The coefficient of cloudcover is statistically significant too.
The regression equation is now
\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday} + b_{\textrm{cloudcover}} \times \textrm{cloudcover}\]
Unfortunately, we start to run into problems visualising models with two or more continuous predictors due to the number of dimensions involved. This is about as far as we can go, because humans can really only visualise \(3\)-dimensions, and here we have now one response dimension volume, plus two continuous predictor dimensions: avgtemp and cloudcover, together with a categorical predictor weekday.
A linear model involving two continuous predictors gives us a plane of best-fit. Here, with two continuous- and one categorical predictor, we go one step further and generate a parallel planes model similar in concept to the parallel slopes model we saw above.
Let’s see a \(2\)-dimensional representation of this model using facets (as generated automatically by ggPredict())
ggPredict(model5, colorn = 10, interactive = TRUE)
To interpret this plot, imagine in each facet that there is a cloudcover axis pointing into the plane of the screen, starting at \(0\) closest to you, and reaching \(10\) some distance behind the screen. The ten lines plotted show where the best fit plane lies at that value of cloudcover. So, in both cases, the plane would slope downwards ‘into’ the screen, i.e. increasing cloudcover leads to decreasing volume of use of the rail trail. We see the parallel planes model clearly, as we have two distinct but parallel planes (one in each facet) corresponding to weekday = TRUE and weekday = FALSE.
This is about as far as we can reasonably go; we’ll stop trying to visualise models beyond this level of complexity.
We will now think about how to interpret fitted coefficients in multiple linear regression. Let’s look again at the regression model we used above:
\[\widehat{\textrm{volume}} = \textrm{intercept} + b_{\textrm{avgtemp}} \times \textrm{avgtemp} + b_{\textrm{weekday}} \times \textrm{weekday} + b_{\textrm{cloudcover}} \times \textrm{cloudcover}\]
Now, when R adds weekday into the model, it generates a dummy variable weekdayTRUE which is either \(1\) or \(0\). If \(1\) for a row, then the fitted coefficient \(-47.9480\) is added to that value of the estimated response \(\widehat{\textrm{volume}}\). We would interpret this as ‘on average, weekdays experience volume that is \(48\) users lower than weekend days, for constant values of avgtemp and cloudcover’.
So, much for categorical predictors, but what about the two continuous predictors avgtemp and cloudcover. How do we interpret their fitted coefficients? In the following way:
By ‘constant’ here we mean ‘fixed’. So, for a specific example, the fitted coefficient \(b_{\textrm{avgtemp}} = 5.2461\) can be interpreted as follows: ‘a \(1\) degree Fahrenheit increase in average temperature raises the rail trail user volume by \(5.25\) users on average, with all other predictors held constant’.
Task - 5 mins
Add
precip(daily precipitation in inches) into the regression model. Perform diagnostics, and if you find thatprecipis a significant predictor, interpret its fitted coefficient.
Solution
model6 <- lm(volume ~ avgtemp + cloudcover + weekday + precip, data = RailTrail_trim) summary(model6)## ## Call: ## lm(formula = volume ~ avgtemp + cloudcover + weekday + precip, ## data = RailTrail_trim) ## ## Residuals: ## Min 1Q Median 3Q Max ## -218.881 -58.924 3.428 57.285 266.250 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 155.0024 59.4574 2.607 0.010787 * ## avgtemp 5.8743 0.9485 6.193 2.02e-08 *** ## cloudcover -12.8294 3.4783 -3.688 0.000397 *** ## weekdayTRUE -45.8821 22.5943 -2.031 0.045412 * ## precip -117.5681 43.2329 -2.719 0.007928 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 96.72 on 85 degrees of freedom ## Multiple R-squared: 0.4501, Adjusted R-squared: 0.4242 ## F-statistic: 17.39 on 4 and 85 DF, p-value: 1.849e-10par(mfrow = c(2, 2)) plot(model6)Looks promising! All the coefficients are statistically significant at a significance level of \(0.05\). We interpret the coefficient of
precipas follows: ‘a \(1\) inch increase in daily precipitation lowers rail trail volume by approximately \(118\) users on average, with all other predictors held constant’.
Imagine we come up with the idea that there should be an interaction between avgtemp and precip, i.e. we think something along the lines that “days that are both hot and wet should lead to particularly low usage of the rail trail”. Frankly, that does sound like an unpleasant environment in which to run or cycle!
We can straightforwardly add this interaction in patsy notation as avgtemp:precip.
model7 <- lm(volume ~ avgtemp + cloudcover + weekday + precip + avgtemp:precip, data = RailTrail_trim)
summary(model7)
##
## Call:
## lm(formula = volume ~ avgtemp + cloudcover + weekday + precip +
## avgtemp:precip, data = RailTrail_trim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -218.775 -59.164 3.183 57.361 266.189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 153.9280 63.9007 2.409 0.018189 *
## avgtemp 5.8964 1.0598 5.564 3.08e-07 ***
## cloudcover -12.8866 3.6984 -3.484 0.000786 ***
## weekdayTRUE -45.8527 22.7365 -2.017 0.046921 *
## precip -100.8213 353.3547 -0.285 0.776097
## avgtemp:precip -0.2335 4.8893 -0.048 0.962023
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97.29 on 84 degrees of freedom
## Multiple R-squared: 0.4501, Adjusted R-squared: 0.4173
## F-statistic: 13.75 on 5 and 84 DF, p-value: 8.34e-10
par(mfrow = c(2, 2))
plot(model7)
We see, however, that inclusion of the interaction is not justified, its own fitted coefficient is not statistically significant, and it also renders precip an insignificant predictor. Not good.
How could we have investigated this interaction before we added it to the model? The conditional plot is helpful! We can obtain a conditional plot in R using the coplot() function as follows:
coplot(volume ~ avgtemp | precip,
data = RailTrail_trim)
We tell the function to plot volume against avgtemp, conditional on the value of precip using the vertical bar | to mean ‘conditional upon’, as in the earlier lessons on probability theory.
Think of this plot as showing
volume versus avgtemp given precip’
You may also hear this called a ‘shingles’ or ‘roof tile’ plot, because the top graph resembles overlapping tiles on a roof (AKA shingles in some dialects). If we like we can add a best fit line to each panel using the panel = argument (this takes in a function, telling coplot() what to do in each panel). We can also control the degree of overlap of the shingles with the overlap = argument.
coplot(volume ~ avgtemp | precip,
panel = function(x, y, ...){
points(x, y)
abline(lm(y ~ x), col = "blue")
},
overlap = 0.2,
data = RailTrail_trim)
How do we interpret the results of coplot()? First, realise that the three scatter plots in the bottom part of the graph correspond to the three shingles in the top plot in the following order:
Coplots always go this way: in order, left to right in each row and then bottom to top in rows of scatter plots correspond to the shingles from bottom to top. Each scatter plot shows \(x\) and \(y\) data for the \(z\) values in the corresponding shingle. So, for the plot above, we have
volume versus avgtemp for precip values in a narrow range from \(0\) inches to approximately \(0.01\) inches (i.e. essentially the days where it did not rain).volume versus avgtemp for precip values in a broader range from \(0\) inches to approximately \(0.16\) inches.volume versus avgtemp for precip values in a very broad range from approximately \(0.14\) inches to approximately \(1.5\) inches.Why have the shingles ended up being the width that they are? R tries to arrange the shingles so that, as far as possible, they contain near equal numbers of data points.
Now, here’s the key point. If we examine the scatter points and find that the slope of the best fit line varies systematically as the shingle range varies, this indicates a potentially significant interaction between the \(x\) predictor and the conditioning predictor!
Here’s a plot showing schematically what a coplot() for a significant interaction might look like
Fig. 1 Coplot showing a clear interaction between x and conditioning variable z. The lines slopes vary systematically from positive to negative as the z-range varies.
Alas, we can see no clear systematic change in the slopes of the best fit lines in our coplot(volume ~ avgtemp | precip,...), indicating that there is likely not a significant interaction between avgtemp and precip, as the regression diagnostics above confirmed for us.
+ operator in formula notation.
: operator.
R function do we use to obtain such a plot?
R provides the coplot() function for this purpose.
R!